This paper deals with the analysis called association rules. A common example is customer purchase analysis, which is an attempt to guess whether the consumer will buy the product/products Y if he buys product/products X. product/products X.
The Previous research data is taken from kaggle platform. https://www.kaggle.com/gorkhachatryan01/purchase-behaviour
The original paper is from my ML project, I would like to improve it by Reproducible Research.
Do reproducibility of association rules on different datasets: https://www.kaggle.com/roshansharma/market-basket-optimization/version/1
library(kableExtra)
library(arules)
## Loading required package: Matrix
##
## Attaching package: 'arules'
## The following objects are masked from 'package:base':
##
## abbreviate, write
library(arulesViz)
transactions = read.transactions(
"Market_Basket_Optimisation.csv",
format = "basket",
sep = ",",
skip = 0,
header = TRUE
)
transactions
## transactions in sparse format with
## 7500 transactions (rows) and
## 119 items (columns)
itemFrequencyPlot(
transactions,
topN = 20,
type = "absolute",
main = "Item frequency",
cex.names = 0.85
)
The figure above shows the twenty most popular purchases. Mineral water comes first, followed by eggs, spaghetti, french fires and chocolate.
We should start the analysis by creating rules, to do this I will use the Apriori algorithm. Because the algorithm did not find enough rules for the base values of confidence and support, I decided to lower their values to 0.01 (support) and 0.4 (confidence). After calculations, the algorithm found 17 rules.
rules = apriori(transactions, parameter = list(supp = 0.01, conf = 0.40))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.4 0.1 1 none FALSE TRUE 5 0.01 1
## maxlen target ext
## 10 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 75
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[119 item(s), 7500 transaction(s)] done [0.00s].
## sorting and recoding items ... [75 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 done [0.00s].
## writing ... [17 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
Association rules analysis is a technique to uncover how items are associated to each other.
There are three common ways to measure association: Support/Confidence/Lift.
There are some examples in my git: https://github.com/wzs19961101/final-projects-.git
Support is a measure of how often a certain subset of products appeared in the whole set of transactions. In other words this is the probability of appearing a transaction with all items together. Below are top six rules in terms of support value.
rules_supp = sort(rules, by = "support", decreasing = TRUE)
rules_supp_table = inspect(head(rules_supp), linebreak = FALSE)
## lhs rhs support confidence
## [1] {ground beef} => {mineral water} 0.04093333 0.4165536
## [2] {olive oil} => {mineral water} 0.02746667 0.4178499
## [3] {soup} => {mineral water} 0.02306667 0.4564644
## [4] {ground beef,spaghetti} => {mineral water} 0.01706667 0.4353741
## [5] {ground beef,mineral water} => {spaghetti} 0.01706667 0.4169381
## [6] {chocolate,spaghetti} => {mineral water} 0.01586667 0.4047619
## coverage lift count
## [1] 0.09826667 1.748266 307
## [2] 0.06573333 1.753707 206
## [3] 0.05053333 1.915771 173
## [4] 0.03920000 1.827256 128
## [5] 0.04093333 2.394361 128
## [6] 0.03920000 1.698777 119
rules_supp_table %>%
kable() %>%
kable_styling()
| lhs | rhs | support | confidence | coverage | lift | count | ||
|---|---|---|---|---|---|---|---|---|
| [1] | {ground beef} | => | {mineral water} | 0.0409333 | 0.4165536 | 0.0982667 | 1.748266 | 307 |
| [2] | {olive oil} | => | {mineral water} | 0.0274667 | 0.4178499 | 0.0657333 | 1.753707 | 206 |
| [3] | {soup} | => | {mineral water} | 0.0230667 | 0.4564644 | 0.0505333 | 1.915771 | 173 |
| [4] | {ground beef,spaghetti} | => | {mineral water} | 0.0170667 | 0.4353741 | 0.0392000 | 1.827256 | 128 |
| [5] | {ground beef,mineral water} | => | {spaghetti} | 0.0170667 | 0.4169381 | 0.0409333 | 2.394361 | 128 |
| [6] | {chocolate,spaghetti} | => | {mineral water} | 0.0158667 | 0.4047619 | 0.0392000 | 1.698777 | 119 |
Considering the result with the highest support value (around 4%), which means that 307 transactions out of a total of 7,500 contained ground beef and mineral water. The second means that oliver oil and mineral water transaction was present in 2.7% of transactions. And third soup and mineral water appeared in 2.3% of transactions.
Confidence is a measure of how likely it is that the consumer buys product Y (rhs) if he has product/products X (lhs) in his basket. In more formal way it is the estimated conditional probability of seeing Y product/s in a transaction under the condition that the transaction also contains X product/s.
rules_conf = sort(rules, by = "confidence", decreasing = TRUE)
rules_conf_table = inspect(head(rules_conf), linebreak = FALSE)
## lhs rhs support confidence
## [1] {eggs,ground beef} => {mineral water} 0.01013333 0.5066667
## [2] {ground beef,milk} => {mineral water} 0.01106667 0.5030303
## [3] {chocolate,ground beef} => {mineral water} 0.01093333 0.4739884
## [4] {frozen vegetables,milk} => {mineral water} 0.01106667 0.4689266
## [5] {soup} => {mineral water} 0.02306667 0.4564644
## [6] {pancakes,spaghetti} => {mineral water} 0.01146667 0.4550265
## coverage lift count
## [1] 0.02000000 2.126469 76
## [2] 0.02200000 2.111207 83
## [3] 0.02306667 1.989319 82
## [4] 0.02360000 1.968075 83
## [5] 0.05053333 1.915771 173
## [6] 0.02520000 1.909736 86
rules_conf_table %>%
kable() %>%
kable_styling()
| lhs | rhs | support | confidence | coverage | lift | count | ||
|---|---|---|---|---|---|---|---|---|
| [1] | {eggs,ground beef} | => | {mineral water} | 0.0101333 | 0.5066667 | 0.0200000 | 2.126469 | 76 |
| [2] | {ground beef,milk} | => | {mineral water} | 0.0110667 | 0.5030303 | 0.0220000 | 2.111207 | 83 |
| [3] | {chocolate,ground beef} | => | {mineral water} | 0.0109333 | 0.4739884 | 0.0230667 | 1.989319 | 82 |
| [4] | {frozen vegetables,milk} | => | {mineral water} | 0.0110667 | 0.4689266 | 0.0236000 | 1.968074 | 83 |
| [5] | {soup} | => | {mineral water} | 0.0230667 | 0.4564644 | 0.0505333 | 1.915771 | 173 |
| [6] | {pancakes,spaghetti} | => | {mineral water} | 0.0114667 | 0.4550265 | 0.0252000 | 1.909736 | 86 |
Confidence values in the six presented are quite similar (from 45% to 50%). Let’s only analyze the basket with the highest confidence.
The value of confidence says that when buying eggs and ground beef with a probability rate of 51%, the consumer will also buy mineral water.
Lift is understood as a measure of sorts correlation. Put simply, it says about how likely it is that products X and Y will be bought together or separately.
A value greater than 1 says that products should be bought together, a value less than one says that they should be bought separately.
rules_lift = sort(rules, by = "lift", decreasing = TRUE)
rules_lift_table = inspect(head(rules_lift), linebreak = FALSE)
## lhs rhs support confidence
## [1] {ground beef,mineral water} => {spaghetti} 0.01706667 0.4169381
## [2] {eggs,ground beef} => {mineral water} 0.01013333 0.5066667
## [3] {ground beef,milk} => {mineral water} 0.01106667 0.5030303
## [4] {chocolate,ground beef} => {mineral water} 0.01093333 0.4739884
## [5] {frozen vegetables,milk} => {mineral water} 0.01106667 0.4689266
## [6] {soup} => {mineral water} 0.02306667 0.4564644
## coverage lift count
## [1] 0.04093333 2.394361 128
## [2] 0.02000000 2.126469 76
## [3] 0.02200000 2.111207 83
## [4] 0.02306667 1.989319 82
## [5] 0.02360000 1.968075 83
## [6] 0.05053333 1.915771 173
rules_lift_table %>%
kable() %>%
kable_styling()
| lhs | rhs | support | confidence | coverage | lift | count | ||
|---|---|---|---|---|---|---|---|---|
| [1] | {ground beef,mineral water} | => | {spaghetti} | 0.0170667 | 0.4169381 | 0.0409333 | 2.394361 | 128 |
| [2] | {eggs,ground beef} | => | {mineral water} | 0.0101333 | 0.5066667 | 0.0200000 | 2.126469 | 76 |
| [3] | {ground beef,milk} | => | {mineral water} | 0.0110667 | 0.5030303 | 0.0220000 | 2.111207 | 83 |
| [4] | {chocolate,ground beef} | => | {mineral water} | 0.0109333 | 0.4739884 | 0.0230667 | 1.989319 | 82 |
| [5] | {frozen vegetables,milk} | => | {mineral water} | 0.0110667 | 0.4689266 | 0.0236000 | 1.968074 | 83 |
| [6] | {soup} | => | {mineral water} | 0.0230667 | 0.4564644 | 0.0505333 | 1.915771 | 173 |
Analyzing the values of the top six transactions, we can see that for all of them Lift values are higher than one. So we can conclude that rhs products are more likely to be bought with other products (lhs list) than if they were independent. For {ground beef, mineral water} => {spaghetti} rule, items have been seen together in transactions at the 2.39 rate expected under independence between them.
plot(rules, engine="plotly")
Let’s also look at the graph showing the location of the transaction data relative to support (horizontal axis), confidence (vertical axis) and lift (color saturation). Most of the values are arranged in a hyperbolic shape, suggesting that as confidence increases, support decreases. This is mainly due to similarities in the way they are calculated, but thanks to this, some outliers are clearer to see. (such as {soup}=>{mineral water})
Let’s say we just want to look at chocolate as our rhs, in simple terms we want to find out what products usually are bought before or together with chocolate. With which products does the consumer like to buy chocolate the most.
rules_chocolate = apriori(
data = transactions,
parameter = list(supp = 0.001, conf = 0.7),
appearance = list(default = "lhs", rhs = "chocolate"),
control = list(verbose = F)
)
rules_chocolate_table = inspect(rules_chocolate, linebreak = FALSE)
## lhs rhs
## [1] {red wine,tomato sauce} => {chocolate}
## [2] {almonds,olive oil,spaghetti} => {chocolate}
## [3] {almonds,milk,spaghetti} => {chocolate}
## [4] {escalope,french fries,shrimp} => {chocolate}
## [5] {burgers,olive oil,pancakes} => {chocolate}
## [6] {frozen vegetables,mineral water,pancakes,shrimp} => {chocolate}
## support confidence coverage lift count
## [1] 0.001066667 0.8000000 0.001333333 4.882018 8
## [2] 0.001066667 0.7272727 0.001466667 4.438198 8
## [3] 0.001066667 0.7272727 0.001466667 4.438198 8
## [4] 0.001066667 0.8888889 0.001200000 5.424464 8
## [5] 0.001200000 0.7500000 0.001600000 4.576892 9
## [6] 0.001066667 0.7272727 0.001466667 4.438198 8
rules_chocolate_table %>%
kable() %>%
kable_styling()
| lhs | rhs | support | confidence | coverage | lift | count | ||
|---|---|---|---|---|---|---|---|---|
| [1] | {red wine,tomato sauce} | => | {chocolate} | 0.0010667 | 0.8000000 | 0.0013333 | 4.882018 | 8 |
| [2] | {almonds,olive oil,spaghetti} | => | {chocolate} | 0.0010667 | 0.7272727 | 0.0014667 | 4.438198 | 8 |
| [3] | {almonds,milk,spaghetti} | => | {chocolate} | 0.0010667 | 0.7272727 | 0.0014667 | 4.438198 | 8 |
| [4] | {escalope,french fries,shrimp} | => | {chocolate} | 0.0010667 | 0.8888889 | 0.0012000 | 5.424464 | 8 |
| [5] | {burgers,olive oil,pancakes} | => | {chocolate} | 0.0012000 | 0.7500000 | 0.0016000 | 4.576892 | 9 |
| [6] | {frozen vegetables,mineral water,pancakes,shrimp} | => | {chocolate} | 0.0010667 | 0.7272727 | 0.0014667 | 4.438198 | 8 |
plot(rules_chocolate, engine="plotly")
## To reduce overplotting, jitter is added! Use jitter = 0 to prevent jitter.
From this project,we could see that Association Rules are an extremely interesting method of data analysis which can relatively easily find out about many interesting relationships. And also, I did Reproducible Research by using same methods for another datasets, which prove the reproducibility of my code.
library(arules)
library(arulesViz)
setwd("C:/Users/wangz/Desktop")
md = read.transactions("dataset.csv",format = "basket",
sep = ",",skip = 0, header = TRUE)
dim(md)
## [1] 1498 38
#average number of items
ave_size = mean(size(md));
ave_size
## [1] 10.34913
summary(md)
## transactions as itemMatrix in sparse format with
## 1498 rows (elements/itemsets/transactions) and
## 38 columns (items) and a density of 0.2723456
##
## most frequent items:
## vegetables poultry waffles bagels lunch meat (Other)
## 894 431 418 417 413 12930
##
## element (itemset/transaction) length distribution:
## sizes
## 3 4 5 6 7 8 9 10 11 12 13 14
## 8 57 51 51 71 74 95 191 304 320 212 64
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.00 9.00 11.00 10.35 12.00 14.00
##
## includes extended item information - examples:
## labels
## 1 all- purpose
## 2 aluminum foil
## 3 bagels
Check what products appear most/least often, and get visualization and plots
# relative frequency
round(itemFrequency(md, type="relative"),4)
## all- purpose aluminum foil
## 0.2630 0.2637
## bagels beef
## 0.2784 0.2623
## butter cereals
## 0.2610 0.2737
## cheeses coffee/tea
## 0.2603 0.2630
## dinner rolls dishwashing liquid/detergent
## 0.2583 0.2684
## eggs flour
## 0.2690 0.2570
## fruits hand soap
## 0.2637 0.2377
## ice cream individual meals
## 0.2750 0.2717
## juice ketchup
## 0.2577 0.2503
## laundry detergent lunch meat
## 0.2644 0.2757
## milk mixes
## 0.2710 0.2737
## paper towels pasta
## 0.2550 0.2717
## pork poultry
## 0.2497 0.2877
## sandwich bags sandwich loaves
## 0.2497 0.2490
## shampoo soap
## 0.2477 0.2657
## soda spaghetti sauce
## 0.2737 0.2543
## sugar toilet paper
## 0.2670 0.2704
## tortillas vegetables
## 0.2443 0.5968
## waffles yogurt
## 0.2790 0.2684
# plot for relative frequency
itemFrequencyPlot(
md,
topN = 10,
type = "relative",
main = "Item frequency",
cex.names = 0.85
)
#absolute frequency
itemFrequency(md, type="absolute")
## all- purpose aluminum foil
## 394 395
## bagels beef
## 417 393
## butter cereals
## 391 410
## cheeses coffee/tea
## 390 394
## dinner rolls dishwashing liquid/detergent
## 387 402
## eggs flour
## 403 385
## fruits hand soap
## 395 356
## ice cream individual meals
## 412 407
## juice ketchup
## 386 375
## laundry detergent lunch meat
## 396 413
## milk mixes
## 406 410
## paper towels pasta
## 382 407
## pork poultry
## 374 431
## sandwich bags sandwich loaves
## 374 373
## shampoo soap
## 371 398
## soda spaghetti sauce
## 410 381
## sugar toilet paper
## 400 405
## tortillas vegetables
## 366 894
## waffles yogurt
## 418 402
#plot for absolute frequency
itemFrequencyPlot(
md,
topN = 10,
type = "absolute",
main = "Item frequency",
cex.names = 0.85
)
The figure above shows the 10 most popular purchases. Vegetables is first, then poultry and waffles.
#Plot for min support
itemFrequencyPlot(md, support = 0.1) #minimum support at 10%
Association rules
Association rules analysis is a technique to uncover how items are associated to each other. There are three common ways to measure association: Support/Confidence/Lift
Global rules calculations
I use the Apriori algorithm. To simplify the analysis, I used the values: Confidence = 0.4, support = 0.1 After calculations, the algorithm found 38 rules.
rules = apriori(md, parameter = list(supp = 0.1, conf = 0.4))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.4 0.1 1 none FALSE TRUE 5 0.1 1
## maxlen target ext
## 10 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 149
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[38 item(s), 1498 transaction(s)] done [0.00s].
## sorting and recoding items ... [38 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 done [0.00s].
## writing ... [38 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
Support
Support is a measure of how often a certain subset of items appeared in the whole data.
rules_supp = sort(rules, by = "support", decreasing = TRUE)
rules_supp_table = inspect(head(rules_supp), linebreak = FALSE)
## lhs rhs support confidence coverage lift
## [1] {} => {vegetables} 0.5967957 0.5967957 1.0000000 1.000000
## [2] {yogurt} => {vegetables} 0.1762350 0.6567164 0.2683578 1.100404
## [3] {poultry} => {vegetables} 0.1748999 0.6078886 0.2877170 1.018587
## [4] {laundry detergent} => {vegetables} 0.1728972 0.6540404 0.2643525 1.095920
## [5] {lunch meat} => {vegetables} 0.1715621 0.6222760 0.2757009 1.042695
## [6] {cereals} => {vegetables} 0.1702270 0.6219512 0.2736983 1.042151
## count
## [1] 894
## [2] 264
## [3] 262
## [4] 259
## [5] 257
## [6] 255
Confidence
Confidence is a measure of how likely it is that the consumer buys product Y (rhs) if he has product/products X (lhs) in his basket. In more formal way it is the estimated conditional probability of seeing Y product/s in a transaction under the condition that the transaction also contains X product/s.
rules_conf = sort(rules, by = "confidence", decreasing = TRUE)
rules_conf_table = inspect(head(rules_conf), linebreak = FALSE)
## lhs rhs support confidence coverage lift
## [1] {yogurt} => {vegetables} 0.1762350 0.6567164 0.2683578 1.100404
## [2] {laundry detergent} => {vegetables} 0.1728972 0.6540404 0.2643525 1.095920
## [3] {eggs} => {vegetables} 0.1695594 0.6302730 0.2690254 1.056095
## [4] {lunch meat} => {vegetables} 0.1715621 0.6222760 0.2757009 1.042695
## [5] {cereals} => {vegetables} 0.1702270 0.6219512 0.2736983 1.042151
## [6] {flour} => {vegetables} 0.1595461 0.6207792 0.2570093 1.040187
## count
## [1] 264
## [2] 259
## [3] 254
## [4] 257
## [5] 255
## [6] 239
Lift
Lift is understood as a measure of sorts correlation. Put simply, it says about how likely it is that products X and Y will be bought together or separately. A value greater than one says that products should be bought together, a value less than one says that they should be bought separately.
rules_lift = sort(rules, by = "lift", decreasing = TRUE)
rules_lift_table = inspect(head(rules_lift), linebreak = FALSE)
## lhs rhs support confidence coverage lift
## [1] {yogurt} => {vegetables} 0.1762350 0.6567164 0.2683578 1.100404
## [2] {laundry detergent} => {vegetables} 0.1728972 0.6540404 0.2643525 1.095920
## [3] {eggs} => {vegetables} 0.1695594 0.6302730 0.2690254 1.056095
## [4] {lunch meat} => {vegetables} 0.1715621 0.6222760 0.2757009 1.042695
## [5] {cereals} => {vegetables} 0.1702270 0.6219512 0.2736983 1.042151
## [6] {flour} => {vegetables} 0.1595461 0.6207792 0.2570093 1.040187
## count
## [1] 264
## [2] 259
## [3] 254
## [4] 257
## [5] 255
## [6] 239
we can see the result, for all of them, Lift values are higher than 1. So we can say that rhs products are more likely to be bought with other products (lhs list) than if they were independent.
plot(rules, engine="plotly")
Change rhs to another product–Ice cream rules calculation
In our data, vegetables is the most frequent product in the basket analysis, we cannot observe any rules. So let’s use another product as our rhs: I will take Ice cream
rules_ice_cream = apriori(
data = md,
parameter = list(supp = 0.01, conf = 0.4),
appearance = list(default = "lhs", rhs = "ice cream"),
control = list(verbose = F)
)
rules_ice_cream_table = inspect(rules_ice_cream, linebreak = FALSE)
## lhs rhs
## [1] {hand soap,spaghetti sauce,vegetables} => {ice cream}
## [2] {cereals,paper towels,sandwich loaves} => {ice cream}
## [3] {all- purpose,lunch meat,spaghetti sauce} => {ice cream}
## [4] {aluminum foil,pasta,spaghetti sauce} => {ice cream}
## [5] {dishwashing liquid/detergent,flour,paper towels} => {ice cream}
## [6] {aluminum foil,paper towels,soda} => {ice cream}
## [7] {aluminum foil,coffee/tea,soda} => {ice cream}
## [8] {aluminum foil,juice,milk} => {ice cream}
## [9] {aluminum foil,beef,yogurt} => {ice cream}
## [10] {aluminum foil,beef,vegetables} => {ice cream}
## [11] {aluminum foil,milk,toilet paper} => {ice cream}
## support confidence coverage lift count
## [1] 0.01001335 0.4054054 0.02469960 1.474023 15
## [2] 0.01001335 0.4838710 0.02069426 1.759317 15
## [3] 0.01001335 0.4054054 0.02469960 1.474023 15
## [4] 0.01001335 0.5000000 0.02002670 1.817961 15
## [5] 0.01001335 0.5000000 0.02002670 1.817961 15
## [6] 0.01001335 0.4838710 0.02069426 1.759317 15
## [7] 0.01134846 0.4594595 0.02469960 1.670559 17
## [8] 0.01001335 0.5000000 0.02002670 1.817961 15
## [9] 0.01001335 0.4545455 0.02202937 1.652692 15
## [10] 0.01802403 0.4576271 0.03938585 1.663897 27
## [11] 0.01134846 0.4358974 0.02603471 1.584889 17
Due to fewer transactions of this type, I reduce the initial support value to 0.01.
Due to the small sample, there is no clear pattern between the results of the analysis
plot(rules_ice_cream, engine="plotly")
## To reduce overplotting, jitter is added! Use jitter = 0 to prevent jitter.
plot(rules_ice_cream, method="graph")
In this paper, I used mainly Apriori method for association rules. Despite results is not very good, I think that Association Rules are an interesting method of data analysis.